Youtube Trend Exploration by Soumya Ghosh

This report explores a dataset containing video id,views count,number of likes,subscriber and other attributes for approximately 4,600 Youtube Trending videos.

Univariate Plots Section

## [1] 4547   23
## 'data.frame':    4547 obs. of  23 variables:
##  $ video_id                   : Factor w/ 4547 levels "-_jlqATo9eo",..: 322 281 548 3186 1308 1796 374 2794 2271 3749 ...
##  $ last_trending_date         : Factor w/ 110 levels "2017-11-14","2017-11-15",..: 7 7 7 7 6 7 5 6 2 2 ...
##  $ publish_date               : Factor w/ 211 levels "2006-07-23","2008-04-05",..: 100 100 99 100 99 100 99 99 100 100 ...
##  $ publish_hour               : Factor w/ 24 levels "0","1","2","3",..: 18 8 20 12 19 20 6 22 15 14 ...
##  $ category_id                : Factor w/ 16 levels "1","2","10","15",..: 8 10 9 10 10 14 10 14 1 11 ...
##  $ channel_title              : Factor w/ 1905 levels "12 News","1MILLION Dance Studio",..: 274 938 1418 645 1215 751 1443 393 4 1834 ...
##  $ views                      : int  2564903 6109402 5315471 913268 2819118 1038365 2688797 1251577 2671756 635985 ...
##  $ likes                      : int  96321 151250 187303 16729 153395 22594 19042 28951 12699 20721 ...
##  $ dislikes                   : int  7972 11508 7278 1386 2416 2798 3059 1146 505 2417 ...
##  $ comment_count              : int  22149 19820 9990 2988 20573 3142 2689 2606 1010 4111 ...
##  $ comments_disabled          : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ratings_disabled           : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tag_appeared_in_title_count: int  0 0 8 3 1 2 4 4 6 2 ...
##  $ tag_appeared_in_title      : logi  FALSE FALSE TRUE TRUE TRUE TRUE ...
##  $ title                      : Factor w/ 4540 levels "'Big one' knocks out several heavy-hitters, sends Daytona 500 to OT",..: 4264 3905 3167 2891 1854 107 3298 182 3777 4445 ...
##  $ tags                       : Factor w/ 4190 levels "#MeToo|Grammys 2018|Janelle Monáe|Kesha",..: 3195 2058 2933 3042 3092 1654 3286 45 3827 4001 ...
##  $ description                : Factor w/ 4416 levels "","'A curious cat helps his owner with home improvements.'\\nWe're releasing a NEW BLACK & WHITE episode every wee"| __truncated__,..: 3136 2766 4111 3965 1675 3445 1024 1734 1897 1187 ...
##  $ trend_day_count            : int  7 7 7 7 6 7 5 6 2 2 ...
##  $ trend.publish.diff         : int  7 7 8 7 7 7 6 7 2 2 ...
##  $ trend_tag_highest          : int  2 65 68 488 488 38 488 113 151 39 ...
##  $ trend_tag_total            : int  2 69 426 1246 1007 122 2216 180 458 170 ...
##  $ tags_count                 : int  1 4 23 28 14 7 42 13 28 20 ...
##  $ subscriber                 : int  9086142 5937292 4191209 13186408 20563106 4652602 5292034 10474796 2453494 3808198 ...
##         video_id     last_trending_date     publish_date   publish_hour 
##  -_jlqATo9eo:   1   2018-03-05: 200     2018-02-05:  71   17     : 408  
##  -0NYY8cqdiQ:   1   2018-01-09: 141     2017-12-13:  70   16     : 404  
##  -1yT-K3c6YI:   1   2018-02-01:  85     2017-12-12:  67   15     : 342  
##  -2b4qSoMnKE:   1   2017-12-13:  70     2018-01-29:  66   18     : 324  
##  -2RVw2_QyxQ:   1   2017-11-14:  69     2017-11-15:  62   14     : 319  
##  -2wRFv-mScQ:   1   2017-11-22:  68     2018-01-26:  61   20     : 246  
##  (Other)    :4541   (Other)   :3914     (Other)   :4150   (Other):2504  
##   category_id                                  channel_title 
##  24     :1102   The Tonight Show Starring Jimmy Fallon:  49  
##  10     : 568   TheEllenShow                          :  44  
##  25     : 436   ESPN                                  :  41  
##  26     : 413   Netflix                               :  41  
##  23     : 380   Jimmy Kimmel Live                     :  39  
##  22     : 352   Refinery29                            :  39  
##  (Other):1296   (Other)                               :4294  
##      views               likes            dislikes       comment_count    
##  Min.   :      559   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:    90896   1st Qu.:   1486   1st Qu.:     76   1st Qu.:    226  
##  Median :   318840   Median :   7397   Median :    291   Median :    854  
##  Mean   :  1265665   Mean   :  39197   Mean   :   2617   Mean   :   4939  
##  3rd Qu.:  1006673   3rd Qu.:  25576   3rd Qu.:   1023   3rd Qu.:   2862  
##  Max.   :149376127   Max.   :3093544   Max.   :1674420   Max.   :1361580  
##                                                                           
##  comments_disabled ratings_disabled tag_appeared_in_title_count
##  False:4471        False:4522       Min.   : 0.000             
##  True :  76        True :  25       1st Qu.: 1.000             
##                                     Median : 3.000             
##                                     Mean   : 2.961             
##                                     3rd Qu.: 4.000             
##                                     Max.   :18.000             
##                                                                
##  tag_appeared_in_title
##  Mode :logical        
##  FALSE:701            
##  TRUE :3846           
##                       
##                       
##                       
##                       
##                                                                                           title     
##  DORITOS BLAZE vs. MTN DEW ICE | Super Bowl Commercial with Peter Dinklage and Morgan Freeman:   2  
##  Justice League - Movie Review                                                               :   2  
##  Maroon 5 - Wait                                                                             :   2  
##  Missouri Star Quilt Company Live Stream                                                     :   2  
##  NBA Bloopers - The Starters                                                                 :   2  
##  Selena Gomez, Marshmello - Wolves                                                           :   2  
##  (Other)                                                                                     :4535  
##                                                                                                                                                                                                                                                                                                     tags     
##  The Late Show|Stephen Colbert|Colbert|Late Show|celebrities|late night|talk show|skits|bit|monologue|The Late Late Show|Late Late Show|letterman|david letterman|comedian|impressions|CBS|joke|jokes|funny|funny video|funny videos|humor|celebrity|celeb|hollywood|famous|James Corden|Corden|Comedy:  25  
##  James Corden|The Late Late Show|Colbert|late night|late night show|Stephen Colbert|Comedy|monologue|comedian|impressions|celebrities|carpool|karaoke|CBS|Late Late Show|Corden|joke|jokes|funny|funny video|funny videos|humor|celebrity|celeb|hollywood|famous                                      :  23  
##  Viral|Video|Epic                                                                                                                                                                                                                                                                                     :  11  
##  cupcakes|how to make vanilla cupcakes|over the top recipes|easy cupcake recipes|vanilla cupcakes|chocolate cupcakes|french macarons|how to make macarons|the scran line|the scranline|nick makrides|pastry design|how to pipe cupcakes                                                               :   7  
##  nba|basketball|starters                                                                                                                                                                                                                                                                              :   7  
##  (Other)                                                                                                                                                                                                                                                                                              :4266  
##  NA's                                                                                                                                                                                                                                                                                                 : 208  
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                description  
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      :  89  
##  Jukin Media Verified (Original) * For licensing / permission to use: Contact - licensing(at)jukinmediadotcom\\nSubmit your videos here: http://bit.ly/2iFnUya                                                                                                                                                                                                                                                                                                                                                                                                                       :  11  
##  ► Listen LIVE: http://power1051fm.com/\\n► Facebook: https://www.facebook.com/Power1051NY/\\n► Twitter: https://twitter.com/power1051/\\n► Instagram: https://www.instagram.com/power1051/                                                                                                                                                                                                                                                                                                                                                                                  :  10  
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      :   4  
##  To get this complete recipe with instructions and measurements, check out my website: http://www.LauraintheKitchen.com\\n\\nInstagram: http://www.instagram.com/mrsvitale\\n\\nOfficial Facebook Page: http://www.facebook.com/LauraintheKitchen\\n\\nContact: Business@LauraintheKitchen.com\\n\\nTwitter: @Lauraskitchen                                                                                                                                                                                                                                                          :   4  
##  Get Cut swag here: http://cut.com/shop\\n\\nDon’t forget to subscribe and follow us!\\nYouTube: http://cut.com/youtube \\nFacebook: http://cut.com/facebook \\nInstagram: http://cut.com/instagram \\nSnapchat: @watchcut\\n\\nProduced, directed, and edited by https://cut.com \\n\\nWant to work with us? http://cut.com/hiring \\nWant to be in a video? http://cut.com/casting \\nLove Cut? Fill out this form for exclusive updates: http://cut.com/fanform \\n\\nWant to sponsor a video? http://cut.com/sponsorships  \\nFor licensing inquiries: http://cut.com/licensing:   3  
##  (Other)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             :4426  
##  trend_day_count  trend.publish.diff trend_tag_highest trend_tag_total 
##  Min.   : 1.000   Min.   :   0.00    Min.   :  0.0     Min.   :   0.0  
##  1st Qu.: 3.000   1st Qu.:   5.00    1st Qu.: 22.0     1st Qu.:  68.0  
##  Median : 5.000   Median :   6.00    Median : 85.0     Median : 217.0  
##  Mean   : 4.831   Mean   :  34.43    Mean   :130.3     Mean   : 437.9  
##  3rd Qu.: 7.000   3rd Qu.:   7.00    3rd Qu.:151.0     3rd Qu.: 515.0  
##  Max.   :14.000   Max.   :4215.00    Max.   :488.0     Max.   :3644.0  
##                                                                        
##    tags_count      subscriber      
##  Min.   : 0.00   Min.   :       0  
##  1st Qu.: 9.00   1st Qu.:  246647  
##  Median :18.00   Median : 1198769  
##  Mean   :19.21   Mean   : 3164303  
##  3rd Qu.:29.00   3rd Qu.: 3766915  
##  Max.   :69.00   Max.   :28676937  
##                  NA's   :22

Our dataset consists of 23 variables, with 4547 observations.

[Notice - I would create a new variable at Bivariate Plots Section.]

## Time difference of 105 days

So there are total 105 days of observation for Youtube trending videos.

I can see peak hour for publishing a trending video in Youtube in between 14:00 to 18:00 in USA timezone. Though it might be the case, that all videos are published during this hour, not only trending ones.

Youtube category wise distribution.

## 
##    1    2   10   15   17   19   20   22   23   24   25   26   27   28   29 
##  228   66  568  113  306   49   53  352  380 1102  436  413  175  291   13 
##   43 
##    2

It can be clearly seen ,category_id = 24 is the category where highest number(1102) of Youtube trending videos were published.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##        0   246647  1198769  3164303  3766915 28676937       22

There are 22 trending video where subscriber information set to NA.

## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
## 208  51  49  94 141 136 113 160 153 130 152 144 117 134 108 124 108 103 
##  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 
##  95 116 104  98 109 109 105  89 123 110  97  84 143  97  90  97  93  81 
##  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53 
##  80  61  65  37  32  25  28  23  22  16  12  11   8  12   9  13   2   3 
##  54  55  56  57  58  59  61  62  63  65  69 
##   2   3   2   4   2   1   2   4   1   1   1

We can see , there are 208 videos which are not using any tag.

#by which.max() trying to find the index of maximum value of the table 
# generated from 'tags_count' column.And then using that index,find out the 
# table content.So generally that will be the mode of the column & frequency.

table(YtUsa$tags_count)[which.max(table(YtUsa$tags_count))]
##   0 
## 208

mode of tags_count is 0.

## [1] 18

Median tags_count is 18.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   2.961   4.000  18.000

From the Histogram & the summary ,we can see, Q3-Q1=68% of the Youtube trending videos are using 1 to 4 tags those are also appeared in the video title.

##    Mode   FALSE    TRUE 
## logical     701    3846

Out of 4547 trending videos , 701 video title had not used any of its tag (or keyword) on the video title.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   4.831   7.000  14.000
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
## 604 455 484 480 611 698 602 279 153  63  53  46  18   1

From 105 days of observation , it can be seen ,that a particular video trended for 14 days at most.

We can see there are 604 videos those had appeared in the Youtube trending video list for only once.

Long tailed distribution transformed for better understanding of the comment counts. Since we are using log10 transformation on x-axis for comment_count, we have to apply comment_count+1 to overcome Infinite values.

## False  True 
##  4471    76

There are only 76 videos where comments_disabled variable set to True.

We can see,maximum number of Youtube videos are listed on trending, within 0 to 14 days of the video publishing date.

2 days trending data are missing in mid January.

We can clearly see ,there is multi-modal distribution over trend_tag_highest. And trend_tag_total distribution is long tailed positively(right) skewed. Both of them are under non-symmetric distribution.

log10() applied on both of the x scale.

Now I am plotting distribution of some key features of the dataset with log10 transformation.

Univariate Analysis

What is the structure of your dataset?

There are 4,547 unique video ids(observations) in the dataset with 23 features.15 of them are independent feature & rest of them are dependent feature.“video_id” is the unique feature of the dataset.

Trending report recorded for 105 days. Category id 24 is the most used category for trending videos. 1,102 of 4,547 videos were published under that category. From 105 days of observation ,we observed that no video repeated(re-trended) on trending list for more than 14 times. It is also observed that most of the videos get listed on Youtube trending list within 0 to 14 days of video publishing date. 84% or more trending videos are using one of its tag on the video title for at least once.

The Median views for a Youtube trending video = 318840. The Median comment count for a Youtube trending video = 854. Maximum number of dislikes a Youtube trending video got 1674420. The Median subscriber count for a Youtube trending video = 1198769. 208 trading videos did not include any tag on the video. 604 videos were trended for only once, means those video were never re-trended within 105 days of period.

What is/are the main feature(s) of interest in your dataset?

Main features in the dataset are: views,comment_count,likes,dislikes & subscriber. I like to determine which features are best for predicting the views of a Youtube trending video. I suspect comment_count,likes,dislikes,subscriber and some combination of the other features could be used to build a predictive model to determine views count of a Youtube trending video.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

Other features in the dataset are : category_id, tag_appeared_in_title, trend_tag_highest(Maximum number of times all trending videos used one of the tag,which is used on the video), trend_tag_total(Total number of times all trending videos used any of the tag ,which are used for the video), trend_day_count(Number of times a video listed on Youtube Trend).

Did you create any new variables from existing variables in the dataset?

Yes ,I created 5 new variables those could be derived from the dataset, names are : tag_appeared_in_title_count,tag_appeared_in_title,trend_tag_highest, trend_tag_total,tags_count.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust,or change

the form of the data? If so, why did you do this?

When I investigate ,I found only 76 observations, where comments_disabled feature is enabled. Also found only 25 observations with ratings_disabled feature enabled. These numbers are very low in respect of total number of observations. I think ,those two are important features of the dataset ,but due to lack of their availability,we would not be confident to make any assumption based upon them.

Yes I performed operations on the original dataset to make it tidy.So this dateset is modified version of the original one. In original dataset there were multiple observation for same video_id . That original dataset could be used for observation as a time series, but as you could see from the feature “trend_day_count” ,not all video_id(s) were repeated in Youtube trending. Therefore time series not available for every video. So for that case, I had to filter trend_day_count > 1 & that would remove 604 trending videos. But for this project ,I liked to observe each & every trending videos.

Bivariate Plots Section

One of the interesting point I would like to share,correlation between likes & comment_count = 0.71 . And correlation between dislikes & comment_count = 0.83 . So we can claim, that more people involved in conversation when they were disliking a video rather than liking a video. Most of these cases ,video might be controversial or a fake news,etc.

## [1] 0.8209508

From the above plot we can see, there is a very strong relationship between views & likes. And the value of the correlation between them is 0.82.

Since log10 applied on the x-axis & and there are few videos in Youtube trending list with 0 likes, thats why we have to pass the variable (likes+1) instead of likes into the scale_x_log10() function. That would help to overcome infinite values(since log10(0) = Inf).

Therefore in the above plot on x-axis scale 1 represent 0.

We can see there are many outliers on y-axis for x = 1. Many of those video authors might be disabled video rating ,so users can’t like or dislike the video.

Another point to see,after 10^4=10000 likes ,variance of likes decreases as views increases.

Plot almost looks similar to views v/s likes plot, but in this case ,variance of dislikes is bigger for some places.

## [1] 0.5289388

Correlation between views & dislikes = 0.53 .

## [1] -0.02158679

Correlation between views and ratio of the likes & dislikes is very weak.

## [1] 0.7128881

Correlation between views and comment_count is vey strong.

Trending videos with less than 500 video description length has lower mean(average) views count than others. [Please see - we observed top 95% CI]

From the above plot,linear regression line represents,as the average length of video titles are increasing, average views counts are slightly decreasing.

tag_appeared_in_title_count not much effecting count of views. [Please see - we observed 2-tailed 95% CI]

## [1] 0.02458608

As like expected correlation between views & tag_appeared_in_title_count = 0.02,which is very weak.

## [1] 0.1904766

For top 95% of the views counts & trend_day_count data, we can say ,as mean trend_day_count are increasing mean views counts are increasing rapidly.

Correlation between views & trend_day_count = 0.19

## [1] 0.3594205

Rank correlation(method=“spearman”) between views & trend_day_count is 0.3594205

Video category_id related with views counts.

## [1] -0.116636

And the value of correlation is -0.116636 (-0.12)

# Finding correlation between views & category_id only for top 95% CI of 
#   views
# Since class of category_id is factor, we need to change it to numeric for
#   correlation calculation.

with(subset(YtUsa,views >= quantile(views,0.0) & views <= 
              quantile(views,0.95)),cor(views,as.numeric(category_id)))
## [1] -0.1038592

Correlation between top 95% of views & category_id = -0.1038592

To make more sense about categorical distribution,lets create a new variable called “subscriber_by_category”,which will represent 3 groups of data.

low : group of all categort_id’s those have small(<=7552015) number of subscribers. medium : group of all categort_id’s those have moderate(from 7552016 to 18185017) number of subscribers. high : group of all categort_id’s those have huge(>18185017) number of subscribers.

Here, low < medium < high

# remember subscriber have NA's, so we need to use a subset of the dataset.
# group by category_id

cat_groups <- group_by(subset(YtUsa,!is.na(subscriber)),category_id)
YtUsa.subs_by_cat <- summarise(cat_groups,subs_max = max(subscriber),n=n())
YtUsa.subs_by_cat <- arrange(YtUsa.subs_by_cat,subs_max)

#Here YtUsa.subs_by_cat is a new data.frame containing 3 columns:
# category_id,subs_max,n

#subs_max would represent the highest subscriber of a video channel for that
# Youtube category_id
#n represent number of Youtube trending videos are present in the category.
YtUsa$subscriber_by_category <- NA # new variable created & NA assigned to 
# its all values.

#now assigning new values from YtUsa.subs_by_cat DataFrame as per conditions

YtUsa[YtUsa$category_id %in% 
        YtUsa.subs_by_cat[YtUsa.subs_by_cat$subs_max<=7552015,]
      $category_id,]$subscriber_by_category <- "low"

YtUsa[YtUsa$category_id %in% 
        YtUsa.subs_by_cat[YtUsa.subs_by_cat$subs_max> 7552015 & 
                            YtUsa.subs_by_cat$subs_max<=18185017,]
      $category_id,]$subscriber_by_category <- "medium"

YtUsa[YtUsa$category_id %in% 
        YtUsa.subs_by_cat[YtUsa.subs_by_cat$subs_max>18185017,]
      $category_id,]$subscriber_by_category <- "high"

# change variable class to factor
YtUsa$subscriber_by_category <- factor(YtUsa$subscriber_by_category)

# ordered to low < medium < high
YtUsa$subscriber_by_category <- 
  ordered(YtUsa$subscriber_by_category,levels=c(
    levels(YtUsa$subscriber_by_category)[2:3],
    levels(YtUsa$subscriber_by_category)[1]))
#To calculate cor, variables must be numeric, that's why, converting  
# subscriber_by_category low,medium,high to numeric 1,2,3.

with(YtUsa,cor(views,as.numeric(subscriber_by_category)))
## [1] 0.09658316

correlation in between views & subscriber_by_category = 0.10

So categories with greater level of subscribers have more chance of getting more Youtube viewers.

## YtUsa$subscriber_by_category: low
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      559    33768   144773   521240   472936 25244097 
## -------------------------------------------------------- 
## YtUsa$subscriber_by_category: medium
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      773    83726   287705   925453   808356 56111957 
## -------------------------------------------------------- 
## YtUsa$subscriber_by_category: high
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       704    125034    406864   1676927   1315056 149376127

It can be seen that 1st Quantile, Median & 3rd Quantile views counts are highly affected by the ‘subscriber_by_category’ feature.

## [1] 0.4602942

Likes & Dislike has a strong relationship with correlation value of 0.46

## 
##  Pearson's product-moment correlation
## 
## data:  views and comment_count
## t = 47.182, df = 4545, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5535431 0.5925758
## sample estimates:
##       cor 
## 0.5733848

Correlation(method=pearson) between views & comment_count is very strong & its value is 0.5733848

## 
##  Spearman's rank correlation rho
## 
## data:  views and comment_count
## S = 2738300000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8252309

rho is a nonparametric measure of rank correlation. rho between views & comment_count is very strong too.And its value is 0.8252309

Regression Line for views v/s trend_tag_highest is monotonic here.

## [1] -0.01307464

And correlation between views & trend_tag_highest is -0.013 here.

## [1] -0.02185687

Relationship between views & trend_tag_total is non linear.

Now Boxplots for views v/s trend_day_count for 2 tailed 95% CI :-

## YtUsa$trend_day_count: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     748   41278  163307  426914  466968 9632678 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 2
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      704    47412   185620   510526   540362 14161833 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 3
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      559    34282   146532   668972   666358 43449654 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 4
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      988    55959   204197   828470   608572 21582276 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 5
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1464    79040   268343   993644   853708 26448434 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 6
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1949   149819   372212   928398   866277 41088994 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 7
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     4318   271626   667336  2147018  1641607 57951412 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 8
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      3170    235848    639536   2611971   2025050 149376127 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 9
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     6688   448169  1115398  3050553  2448968 91933007 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 10
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     24296    479238   1520884   7000378   4553615 102012605 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 11
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    16425   188322   485534  2535995  1891032 34269048 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 12
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   68886  345371 1046764 1465997 2090906 7721222 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 13
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   451602  1685408  2888654  7769948  5305317 45938392 
## -------------------------------------------------------- 
## YtUsa$trend_day_count: 14
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 17540613 17540613 17540613 17540613 17540613 17540613

Median views count for trend_day_count : 1 = 163307 . Median views count for trend_day_count : 14 = 17540613 . And 17540613 >> 163307 . So if a video get listed more times on Youtube trending,its median views count is way more bigger(for maximum cases).

## subscriber_by_category: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     230    1011    5315    5930   97030 
## -------------------------------------------------------- 
## subscriber_by_category: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2264    8843   27730   25045 1988746 
## -------------------------------------------------------- 
## subscriber_by_category: high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2015    9124   55122   33120 3093544

Median likes count for low group of ‘subscriber_by_category’ is far lower than the median likes count for medium & high.

## subscriber_by_category: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0  163289  599310 1132174 1685948 7552015       2 
## -------------------------------------------------------- 
## subscriber_by_category: medium
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##        0   214706  1099202  2305559  3008137 18185017        9 
## -------------------------------------------------------- 
## subscriber_by_category: high
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##        0   335046  1759496  4238331  5292034 28676937       11

Median subscriber for subscriber_by_category: low = 599310 . Median subscriber for subscriber_by_category: medium = 1099202 . Median subscriber for subscriber_by_category: high = 1759496 .

Boxplots for views v/s tag_appeared_in_title for 2 tailed 95% CI shown below:

## YtUsa$tag_appeared_in_title: FALSE
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      704    53227   247049  1167267   795873 56111957 
## -------------------------------------------------------- 
## YtUsa$tag_appeared_in_title: TRUE
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       559     97816    333984   1283600   1042283 149376127

Median views Value for not tag_appeared_in_title = 247049 . Median views Value for tag_appeared_in_title = 333984

From above observations ,we can say,there is a impact on Youtube trending videos views count over tag_appeared_in_title or not.

Above plot showing an obvious point. If tag_appeared_in_title set to False, then tag_appeared_in_title_count should be 0(since one variable derived from another).

For a trending Youtube video ,if difference between first trending date & publish date is less than 4 days,then it would not be re-trended for more than 3 times on Youtube.

## YtUsa$comments_disabled: False
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       559     92536    320792   1258650   1010413 149376127 
## -------------------------------------------------------- 
## YtUsa$comments_disabled: True
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      748    23922   149300  1678329   862810 56111957

Though, there are only 76 observations present ,out of 4547 for comments_disabled = True. But still we can see there is a huge impact on Youtube trending videos views count over comments_disabled or not.

Looks like ratio of likes/dislikes is not uniformal for Youtube trending videos.

It is looks like, most frequently used tags on videos are attached from caterogoty_id 23 & 24

Significant amount of videos are re-trended everyday from subscriber_by_category: high & subscriber_by_category: medium group.

## YtUsa$tag_appeared_in_title: FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   4.655   7.000  13.000 
## -------------------------------------------------------- 
## YtUsa$tag_appeared_in_title: TRUE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   4.863   7.000  14.000

Looks like, number of days a video trended does not much affected by whether the tag appeared on the video title or not.

## [1] 0.003732712

By just observing number of tags(tags_count) attached on a trending video, we could not say how many views that video would get.

There are a good amount of outliers exist in the dataset. This outliers have only few number of subscribers and yet they managed to get higher number of views count.

By Applying geom_smooth() with & without linear method(lm), its look like average count of views increasing as per average number of subscriber increasing. Though relation is not strong. [please see - we observerd the data for 2-tailed 95% CI]

## [1] 0.2657179

As like expected ,Correlation coefficient between views & subscriber is 0.27 .

Now I am buiding a linear model for views v/s likes :

## 
## Call:
## lm(formula = I(views) ~ I(likes), data = subset(YtUsa, !is.na(subscriber) & 
##     !is.na(tags)))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -47329618   -245824   -185412     31998  68253405 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.275e+05  4.100e+04    5.55 3.03e-08 ***
## I(likes)    2.615e+01  2.719e-01   96.17  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2596000 on 4321 degrees of freedom
## Multiple R-squared:  0.6816, Adjusted R-squared:  0.6815 
## F-statistic:  9248 on 1 and 4321 DF,  p-value: < 2.2e-16

For the linear model ,Multiple R-squared value between views & likes = 0.6816

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation.How did the feature(s) of interest vary with other features

in the dataset?

Youtube trending videos views counts strongly correlated with likes ,dislikes ,comment_count. As the likes or dislikes or comment_count increases,views of a video also increases. The relationship between views count and likes or dislikes or comment_count is almost linear.

Linear Coefficient Correlation(method = pearson) between views & likes = 0.82

Linear Coefficient Correlation(method = pearson) between views & dislikes = 0.53

Linear Coefficient Correlation(method = pearson) between views & comment_count = 0.57

And rank Correlation(method = Spearman) between views & comment_count = 0.8252309

Point to see, all above correlations are positive & strong.

Linear Correlation coefficient(method = pearson) between views & subscriber = 0.27

For tag_appeared_in_title: TRUE , median views count is 333984. For tag_appeared_in_title: FALSE , median views count is 247049.

Median subscriber for subscriber_by_category: low = 599310 . Median subscriber for subscriber_by_category: medium = 1099202 . Median subscriber for subscriber_by_category: high = 1759496 .

Most frequestly used tags on videos are attached from caterogoty_id 23 & 24.

Based on the R^2 value,likes explains about 68 percent of the variance in views.Other features of interest can be incorporated into the model to explain the variance in the views.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

Yes, Pearson Coefficient Correlation(PCC) value between views & category_id is -0.12, PCC between views & tag_appeared_in_title = 0.01, PCC between views & trend_tag_highest = -0.01, PCC between views & trend_tag_total = -0.02, PCC between views & trend_day_count = 0.19, PCC in between views & subscriber_by_category = 0.10 .

1st Quantile, Median & 3rd Quantile views counts are highly affected by the ‘subscriber_by_category’ feature.

The relationship between views count and category_id or trend_tag_highest or trend_day_count is monotonic. The relationship between views count and trend_tag_total is non linear.

What was the strongest relationship you found?

Strongest relation between two feature in the dataset are views & likes. Coefficient Correlation between them : 0.82

Multivariate Plots Section

Trend showing ,variance of views per day is getting bigger for higher number of subscriber groups.

By just seeing how many times a video get appeared on the Youtube trend,we can’t imagine how many views a video can get,because variance of views for each day is very big.So range is very bigger here.

## subscriber_by_category: low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     186    9957   36099  126459  124612 5048819 
## -------------------------------------------------------- 
## subscriber_by_category: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     247   21370   62628  182627  173469 6769490 
## -------------------------------------------------------- 
## subscriber_by_category: high
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      248    31536   107318   357914   328971 18672016

From above observation we can say subscriber_by_category: high got maximum median views per day count(107318)

## tag_appeared_in_title: FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     247   14459   55604  246723  198642 6769490 
## -------------------------------------------------------- 
## tag_appeared_in_title: TRUE
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      186    25111    82070   270246   237691 18672016

Median views per day count for tag_appeared_in_title : True is higher than the Median views per day count for tag_appeared_in_title : False

On a trending video comments_disabled or ratings_disabled has an impact over number of viewer per day.

## [1] 0.04160659

Correlation between (views/trend_day_count) & trend_tag_highest is 0.042, which is slightly better than the correlation between views & trend_tag_highest(-0.013),which was calculated during Bivariate Plots Section

Above plot representing top 95% data of (views v/s likes) coloured by different trend_day_count.

Above plot representing top 95% data of (views v/s likes) coloured by 3 different group of categories by subscription numbers.

Across the plot We can see a clear dominance of group medium & high for high views count & high likes for Youtube trending videos.

Above plot showing tag_appeared_in_title has an impact over views v/s top 95% of trend.publish.diff variables.

Above plot representing top 95% views v/s subscriber_by_category data coloured by different trend_day_count.

Above plot representing top 95% views per day v/s subscriber_by_category color differentiate by tag_appeared_in_title or not.

Now I am plotting same plot,but instated of views per day, this time I am using total views of a video.

Though last 2 plots looks similar. But take a close look on y-axis(views) scale value.Its scale value much more bigger than previous plot, because it represents total views of a video_id, not average views on a day.

By taking a closer look on size of the bubble,we can observe,trending videos those have listed for more than 5 times got the highest number of views.

There are multiple number of outliers exist in the plot, many trending videos got huge number of views ,but their subscriber counts are very less. Lets focus to 0 to 1000 subscriber and apply facet_wrap for comments_disabled & ratings_disabled variables.

So we can see, other than ratings_disabled & comments_disabled set to True videos ,there are many other outliers present with subscriber = 0 with huge number of views. I don’t know, the real reason behind this.It might be there are some lurking variables, which might be causing this issue.

## [1] 0.2693141

PCC between views & subscriber with respect of tag_appeared_in_title : True = 0.27

## [1] 0.2444701

PCC between views & subscriber with respect of tag_appeared_in_title : False = 0.24 .

From the last plot & above observations ,it could be said ,that correlation between views & subscriber when tag appeared in the title is more stronger than when tag does not appear in the title.

On other words, if a trending video title does not contain any tag of it,then its number of subscriber & views counts might also get affected(lower).

From the above plot(I considered 2-tailed 95% CI) we can see ,there are few categories where variance of views per day & subscribe are very big. Categories are : 10,23,24.

Also we can say,the video channel, which had the highest level of subscribers for Youtube trending videos, is belongs to category_id: 23

From the above plot(I considered 2-tailed 95% CI) we can say,video belongs to categories where they have highest level of subscribers;those videos are using at least one of its tag on the trending video title.

Lest’s print out the model table :-

## 
## Calls:
## m1: lm(formula = I(views) ~ I(likes), data = subset(YtUsa, !is.na(subscriber) & 
##     !is.na(tags)))
## m2: lm(formula = I(views) ~ I(likes) + comment_count, data = subset(YtUsa, 
##     !is.na(subscriber) & !is.na(tags)))
## m3: lm(formula = I(views) ~ I(likes) + comment_count + dislikes, 
##     data = subset(YtUsa, !is.na(subscriber) & !is.na(tags)))
## m4: lm(formula = I(views) ~ I(likes) + comment_count + dislikes + 
##     trend_day_count, data = subset(YtUsa, !is.na(subscriber) & 
##     !is.na(tags)))
## m5: lm(formula = I(views) ~ I(likes) + comment_count + dislikes + 
##     trend_day_count + category_id, data = subset(YtUsa, !is.na(subscriber) & 
##     !is.na(tags)))
## m6: lm(formula = I(views) ~ I(likes) + comment_count + dislikes + 
##     trend_day_count + category_id + tag_appeared_in_title, data = subset(YtUsa, 
##     !is.na(subscriber) & !is.na(tags)))
## m7: lm(formula = I(views) ~ I(likes) + comment_count + dislikes + 
##     trend_day_count + category_id + tag_appeared_in_title + subscriber, 
##     data = subset(YtUsa, !is.na(subscriber) & !is.na(tags)))
## 
## ===============================================================================================================================================================================================================
##                                     m1                        m2                        m3                        m4                        m5                        m6                        m7             
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                       227538.928***             223254.443***             272224.491***            -140074.259*               453376.011**              680180.046***             682721.882***  
##                                     (40997.972)               (41041.876)               (33945.316)               (68312.277)              (162477.320)              (185747.621)              (186304.987)    
##   I(likes)                              26.150***                 26.694***                 31.970***                 31.626***                 32.124***                 32.148***                 32.165***  
##                                         (0.272)                   (0.388)                   (0.341)                   (0.343)                   (0.350)                   (0.350)                   (0.363)    
##   comment_count                                                   -3.481*                  -94.436***                -93.792***                -94.771***                -94.994***                -95.023***  
##                                                                   (1.766)                   (2.502)                   (2.491)                   (2.486)                   (2.486)                   (2.491)    
##   dislikes                                                                                  75.130***                 74.996***                 74.928***                 74.994***                 75.006***  
##                                                                                             (1.679)                   (1.670)                   (1.656)                   (1.656)                   (1.657)    
##   trend_day_count                                                                                                  87385.740***              93747.348***              94321.998***              94142.424***  
##                                                                                                                   (12586.797)               (12657.862)               (12652.105)               (12692.833)    
##   category_id: 2                                                                                                                            343339.268                339255.564                337885.111     
##                                                                                                                                            (300565.657)              (300384.327)              (300514.672)    
##   category_id: 10                                                                                                                         -1150715.770***           -1138430.648***           -1137304.134***  
##                                                                                                                                            (173793.532)              (173754.849)              (173887.165)    
##   category_id: 15                                                                                                                          -672396.419**             -672963.888**             -672860.020**   
##                                                                                                                                            (250231.202)              (250076.686)              (250105.481)    
##   category_id: 17                                                                                                                          -330136.845               -320163.924               -316810.857     
##                                                                                                                                            (192873.103)              (192794.727)              (193715.043)    
##   category_id: 19                                                                                                                          -580428.071               -586212.922               -587315.148     
##                                                                                                                                            (344618.305)              (344413.049)              (344506.268)    
##   category_id: 20                                                                                                                          -579426.034               -606973.159               -605818.068     
##                                                                                                                                            (330401.999)              (330379.534)              (330479.067)    
##   category_id: 22                                                                                                                          -764477.273***            -778314.746***            -777219.480***  
##                                                                                                                                            (191293.348)              (191254.337)              (191372.705)    
##   category_id: 23                                                                                                                         -1022817.922***           -1032406.476***           -1027304.877***  
##                                                                                                                                            (183815.371)              (183741.367)              (185936.967)    
##   category_id: 24                                                                                                                          -440257.056**             -438662.353**             -436462.550**   
##                                                                                                                                            (161040.744)              (160942.486)              (161424.329)    
##   category_id: 25                                                                                                                          -609647.157***            -628134.362***            -628862.425***  
##                                                                                                                                            (180841.552)              (180879.298)              (180944.902)    
##   category_id: 26                                                                                                                          -657792.027***            -651797.293***            -651178.657***  
##                                                                                                                                            (180466.315)              (180370.562)              (180423.620)    
##   category_id: 27                                                                                                                          -867162.101***            -877987.638***            -876734.028***  
##                                                                                                                                            (219525.715)              (219432.309)              (219567.582)    
##   category_id: 28                                                                                                                          -589269.122**             -590792.585**             -590813.534**   
##                                                                                                                                            (195438.361)              (195318.539)              (195340.544)    
##   category_id: 29                                                                                                                         -2281129.093***           -2311105.192***           -2313453.198***  
##                                                                                                                                            (631008.269)              (630731.041)              (630936.980)    
##   category_id: 43                                                                                                                         -1078145.794              -1052072.679              -1051764.831     
##                                                                                                                                           (1501807.627)             (1500915.476)             (1501085.278)    
##   tag_appeared_in_title                                                                                                                                              -257303.258*              -256639.079*    
##                                                                                                                                                                      (102328.747)              (102406.820)    
##   subscriber                                                                                                                                                                                        -0.001     
##                                                                                                                                                                                                     (0.007)    
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                              0.682                     0.682                     0.783                     0.785                     0.790                     0.790                     0.790     
##   adj. R-squared                         0.681                     0.682                     0.782                     0.785                     0.789                     0.789                     0.789     
##   sigma                            2596319.808               2595453.197               2145557.086               2133928.400               2113155.980               2111850.258               2112087.803     
##   F                                   9248.460                  4629.261                  5183.681                  3942.298                   851.634                   810.369                   771.608     
##   p                                      0.000                     0.000                     0.000                     0.000                     0.000                     0.000                     0.000     
##   Log-likelihood                    -69982.076                -69980.132                -69156.697                -69132.703                -69082.893                -69079.719                -69079.703     
##   Deviance               29127327557805164.000     29101149937090996.000     19882150280982512.000     19662662495539220.000     19214737528568204.000     19186539321071680.000     19186394929287012.000     
##   AIC                               139970.152                139968.265                138323.395                138277.406                138207.787                138203.438                138205.405     
##   BIC                               139989.267                139993.751                138355.253                138315.636                138341.593                138343.616                138351.955     
##   N                                   4323                      4323                      4323                      4323                      4323                      4323                      4323         
## ===============================================================================================================================================================================================================

Now summary of final model :-

## 
## Call:
## lm(formula = I(views) ~ I(likes) + comment_count + dislikes + 
##     trend_day_count + category_id + tag_appeared_in_title + subscriber, 
##     data = subset(YtUsa, !is.na(subscriber) & !is.na(tags)))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -22887876   -440250   -116257    195134  55268218 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                6.827e+05  1.863e+05   3.665 0.000251 ***
## I(likes)                   3.217e+01  3.629e-01  88.627  < 2e-16 ***
## comment_count             -9.502e+01  2.491e+00 -38.141  < 2e-16 ***
## dislikes                   7.501e+01  1.657e+00  45.263  < 2e-16 ***
## trend_day_count            9.414e+04  1.269e+04   7.417 1.44e-13 ***
## category_id2               3.379e+05  3.005e+05   1.124 0.260925    
## category_id10             -1.137e+06  1.739e+05  -6.540 6.85e-11 ***
## category_id15             -6.729e+05  2.501e+05  -2.690 0.007166 ** 
## category_id17             -3.168e+05  1.937e+05  -1.635 0.102028    
## category_id19             -5.873e+05  3.445e+05  -1.705 0.088303 .  
## category_id20             -6.058e+05  3.305e+05  -1.833 0.066849 .  
## category_id22             -7.772e+05  1.914e+05  -4.061 4.97e-05 ***
## category_id23             -1.027e+06  1.859e+05  -5.525 3.49e-08 ***
## category_id24             -4.365e+05  1.614e+05  -2.704 0.006882 ** 
## category_id25             -6.289e+05  1.809e+05  -3.475 0.000515 ***
## category_id26             -6.512e+05  1.804e+05  -3.609 0.000311 ***
## category_id27             -8.767e+05  2.196e+05  -3.993 6.63e-05 ***
## category_id28             -5.908e+05  1.953e+05  -3.025 0.002505 ** 
## category_id29             -2.313e+06  6.309e+05  -3.667 0.000249 ***
## category_id43             -1.052e+06  1.501e+06  -0.701 0.483547    
## tag_appeared_in_titleTRUE -2.566e+05  1.024e+05  -2.506 0.012245 *  
## subscriber                -1.301e-03  7.232e-03  -0.180 0.857230    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2112000 on 4301 degrees of freedom
## Multiple R-squared:  0.7902, Adjusted R-squared:  0.7892 
## F-statistic: 771.6 on 21 and 4301 DF,  p-value: < 2.2e-16

R^2 value of the final model collectively explains only about 79 percent of the variance in views. Residual standard error: 2112000 on 4301 degrees of freedom.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms

of looking at your feature(s) of interest?

I observed,by just seeing how many times a video get appeared on the Youtube trend,we can’t imagine how many views a video can get,because variance of views for each day is very big.

Median views per day for subscriber_by_category: low = 36099 . Median views per day for subscriber_by_category: medium = 62628 . Median views per day for subscriber_by_category: high = 107318 .

Median views per day count for tag_appeared_in_title : True is 82070 . Median views per day count for tag_appeared_in_title : False is 55604 .

Video channel which had the highest level of subscribers for Youtube trending videos,belongs to category_id = 23 .

From top 95% data points of views per day & subscriber,we observed there are few categories where variance of views & subscribe are very big. Categories are : 10,23,24 .

From plots,I saw many of videos have only 0 like, but have higher number of views. And many of videos have zero(0) subscriber,but still get to manage higher number of viewers; those are probably the outlier of the dataset. To filter them,I focused on 0 to 1000 subscriber and apply facet_wrap for comments_disabled & ratings_disabled variables. But surprisingly,comments_disabled=TRUE & ratings_disabled=TRUE ,could not be able to filter out all of the outliers. There must be something else going on ,may be some lurking variables causing this nature.

Trending videos those have listed for more than 5 times(or days), got the highest number of views count. Pearson Coefficient Correlation(PCC) between views & subscriber is more stronger when one of its tag appeared in the title. Variance of views per day is getting bigger for higher number of subscriber groups(low < medium < high for subscriber_by_category).

Were there any interesting or surprising interactions between features?

Yes, Many of the trending videos have lower number of subscriber & yet they managed to get more number of viewers than top subscriber channels. Also I saw there are many trending videos managed to get higher number of views counts,but they have very few likes, many of them have 0 like only, I think some of 0 like videos came from videos where ratings_disabled set to True.

OPTIONAL: Did you create any models with your dataset?

Discuss the strengths and limitations of your model.

Yes, For the final model(m9) ,I got Multiple R-squared: 0.7902 & Residual standard error: 2112000 . So R^2 value of the final model collectively explains only about 79 percent of the variance in views. Also its Residual standard error is very big here & that would cause a large range of Confidence Interval for predictive model. In other words standard error is bigger for the final model. Therefore this model would not be able to calculate/predict views counts accurately.

The model I am looking for, not just for predicting views count of a trending video;I want to flag a video, if its fall behind 95% CI.


Final Plots and Summary

Plot One

Description One

From the above plot, we can analyze that no video trended over 14-day period. We can see there are more than 600 videos those were appeared in the Youtube trending video list for only once(that was the first & last time).

Plot Two

Description Two

Since log10 applied on the x-axis & and there are few videos in Youtube trending list with 0 likes, because of this, we have to pass the variable (likes+1) instead of likes into the scale_x_log10() function. That would help to overcome infinite values(since log10(0) = Inf). Therefore on the above plot, on x-axis, x=1 represents 0 like and x=100 represents 99 likes and so on.

I have applied two smoother lines on the above plot ,one with linear method (red line) & another without linear method(blue line). Here smoother line(Slope Of Regression Line) represents the slope of the line of best fit in the scatterplot. Since there is very strong relationship between views & likes attribute (cor=0.82),hence the slope of the linear line nearer to 1.

We can see there are many outliers exist on y-axis for x = 1 . Many of those video authors might disable video rating ,so users are not able to like or dislike the video. Those outliers causing non-linear regression line to start from (x=1,y=10000).

Plot Three

Description Three

Notice - I took the subset of the dataset to filter out subscriber with NA.

After considering only top 95% viewers & subscribers, we can see ,under any group, trending videos those have not included any of its tag on the video title,tend to have lower number of subscriber & viewers, than videos those have included one(at least).

Trending Videos under subscriber_by_category: high & subscriber_by_category: medium were able to achieve highest level of views in Youtube.

From bubble size ,we can say ,trending Videos those have trended for more than 3 to 4 times(or days),were able to achieve highest level of viewers in Youtybe.

From linear regression line ,we can say,other than videos under subscriber_by_category: low,as per average number of subscriber increases, average number of viewers also increases.


Reflection

The Youtube Trending data set contains information on almost 4,600 unique videos across 23 variables.And it is recorded for total 105 days. [I have a created a categorical variable extra though.]

I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the views of videos across many variables. Views count of a video strongly correlated with likes,dislikes,comment_count. Highest correlation exist between views & likes variables,where cor=0.82 . But there exist some other important features :subscriber,trend_day_count, category_id; they have good correlation values. That’s why I include all of the above features to build predictive model. Model was able to account for only 79% of the variance in the dataset. Also model Residual standard error was big.So model would not be very useful to predict views count accurately for a Youtube trending video.

Right now,I might be unable to build a good predictive model,but I found some interesting facts about Youtube trending videos; these are:- 84% or more trending videos are using one of its tag on the video title for at least once. Other than 604 trending videos,all trending videos are appeared in the trending list for more than 1 once. Maximum number of Youtube videos are listed on trending, within 0 to 14 days of the video publishing date. More users engaged in conversation when they were disliking a trending video rather than liking a trending video. If difference between first trending date & publish date is less than 4 days, then there is a big chance,that video would not be re-trended for more than 3 times. There is a impact on Youtube trending videos views count over tag_appeared_in_title or not. Trending videos those have listed for more than 5 times got the highest number of views. Videos belongs to categories where number of subscriber is/are most ;those videos are using at least one of its tag on the trending video title.

Struggle:-

I struggled to create these two variables : trend_tag_highest, trend_tag_total .

I failed to create these two variables from the dataset directly;it might happen,since I was new in R-Programming.

But some how,by creating a separate(temporary) dataframe from ‘tags’ variable and by using that temporary dataframe,I succeed to create those new variables: trend_tag_highest, trend_tag_total . Trick behind this was to create a similar data structure of Python dictionary in R-Programming. Though I believe there should be some easy technique to achieve the same goal.

Surprise:-

Many of Youube trending videos get listed on trending list for more than 1 time(or day), but they did not get higher number of traffics.

Another point I already discussed,many of the trending videos have lower number of subscriber(some of them have 0) & yet they managed to get greater number of viewers than top subscriber channels present in the Youtube. Also I saw there are many trending videos managed to get higher number of views counts,but they have very few likes(many of them have 0).

Future work :-

In future we could use a OCR(Optical character recognition) technique to scan the thumbnail image to observe whether the thumbnail using any text or not.

We can make a new variable called ‘tag_appeared_in_description’ with the help of ‘tags’ & ‘description’ variables,it would be very similar to the variable already exist: ‘tag_appeared_in_title’ . With similar approach, we can make another variable called ‘title_appeared_in_desccription’ with the help of ‘title’ & ‘description’ variables.

We can create some bucket variables(using cut function) for ‘last_tredning_date’ and ‘tags_count’ variable to make really use of those 2 features.

We can do some element grouping,that might help to extract some hidden information from the data.

Limitation of the dataset:-

There exist a latency(time delay) in the dataset,happened due to scrapping the subscriber column data separately(Subscriber data did not come with the original dataset,so it is recorded during different time).

Possible Lurking Variables :-

There are some lurking variables those could affect the main features of the dataset(views,likes,dislikes,comment_count,subscriber).For example what is the content of the video thumbnail,which is used in the trending video.A auto generated thumbnail(by Youtube system) might cause lower number of views, likes,etc.While on the other hand, if a trending video using a custom thumbnail or a thumbnail which was manually uploaded by the author.And if its containing a perfect image or a text that could provoke viewers to click and check the video(in other word Clickbait),that would affect the trending statistic easily.

Alternatively, if a trending video containing a content which was the trend of the day on the Internet,then it could easily get more attention than any other trending videos irrespective of how many subscriber the video channel had.

Another thing,video might not well presented,either its graphics or sound quality was poor or presentation was poor.

It is also possible that some dishonest users are using some blackhat technique to bypass Youtube trending algorithm.And still yet Youtube algorithms did not flag it.